Comparing Word Relatedness Measures Based on Google $n$-grams

نویسندگان

  • Aminul Islam
  • Evangelos E. Milios
  • Vlado Keselj
چکیده

Estimating word relatedness is essential in natural language processing (NLP), and in many other related areas. Corpus-based word relatedness has its advantages over knowledge-based supervised measures. There are many corpus-based measures in the literature that can not be compared to each other as they use a different corpus. The purpose of this paper is to show how to evaluate different corpus-based measures of word relatedness by calculating them over a common corpus (i.e., the Google n-grams) and then assessing their performance with respect to gold standard relatedness datasets. We evaluate six of these measures as a starting point, all of which are re-implemented using the Google n-gram corpus as their only resource, by comparing their performance in five different data sets. We also show how a word relatedness measure based on a web search engine can be implemented using the Google n-gram corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TrWP: Text Relatedness using Word and Phrase Relatedness

Text is composed of words and phrases. In bag-of-word model, phrases in texts are split into words. This may discard the inner semantics of phrases which in turn may give inconsistent relatedness score between two texts. TrWP , the unsupervised text relatedness approach combines both word and phrase relatedness. The word relatedness is computed using an existing unsupervised co-occurrence based...

متن کامل

A Computationally Efficient Measure for Word Semantic Relatedness Using Time Series

Measurement of words semantic relatedness plays an important role in a wide range of natural language processing and information retrieval applications, such as full-text search, summarization, classification and clustering. In this paper, we propose an easy to implement and low-cost method for estimating words semantic relatedness. The proposed method is based on the utilization of words tempo...

متن کامل

WikiRelate! Computing Semantic Relatedness Using Wikipedia

Wikipedia provides a knowledge base for computing word relatedness in a more structured fashion than a search engine and with more coverage than WordNet. In this work we present experiments on using Wikipedia for computing semantic relatedness and compare it to WordNet on various benchmarking datasets. Existing relatedness measures perform better using Wikipedia than a baseline given by Google ...

متن کامل

Distributed Distributional Similarities of Google Books over the Centuries

This paper introduces a distributional thesaurus and sense clusters computed on the complete Google Syntactic N-grams, which is extracted from Google Books, a very large corpus of digitized books published between 1520 and 2008. We show that a thesaurus computed on such a large text basis leads to much better results than using smaller corpora like Wikipedia. We also provide distributional thes...

متن کامل

Inferring Selectional Preferences from Part-Of-Speech N-grams

We present the PONG method to compute selectional preferences using part-of-speech (POS) N-grams. From a corpus labeled with grammatical dependencies, PONG learns the distribution of word relations for each POS N-gram. From the much larger but unlabeled Google N-grams corpus, PONG learns the distribution of POS N-grams for a given pair of words. We derive the probability that one word has a giv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012